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I . OVERVIEW 


This  paper  is  the  final  report  of  a four-year  project  which  examined  tai- 
lored testing  from  a non-parametric  perspective.  Its  most  distinctive  charac- 
teristic is  the  use  of  the  Guttman  scale  as  an  ideal  of  the  test  setting . The 
goal  of  testing  from  such  a viewpoint  is  a mutual  ordering  of  examinees  and 
items  along  a hypothetical  ability  dimension,  making  use  of  only  simple  coun- 
ting procedures  to  accomplish  the  task . This  method  is  called  tailored  testing 
using  implied  orders,  and  in  the  current  project  the  development  of  these  pro- 
cedures progressed  along  three  fronts. 

(1)  Producing  and  evaluating  computer  programs  to  accom- 
plish implied  orders  tailored  testing. 

(2)  Development  of  principles  of  ordinal  measurement 
which  will  provide  a theoretical  basis  for  tailored 
testing  and  its  evaluation. 

(3)  Suggest  methods  for  carrying  out  and  evaluating 
research  in  tailored  testing. 

Although  this  third  aspect  of  our  work  has  more  often  been  implicit  rather 
than  explicit,  we  believe  it  may  be  an  important  contribution  for  the  field 
in  its  own  right. 

These  three  general  concerns , namely  the  development  of  computer  algorithms 
for  testing,  the  advancement  of  ordinal  techniques  in  test  theory,  and  the 
delineation  of  a research  methodoloqy  for  tailored  testing  have  guided  our 
research.  The  report  which  follows  is  a summary  of  our  current  thinking  in 
each  of  these  areas,  and  includes  many  conclusions  and  suggestions  for  future 
research.  It  is  our  overall  impression  that  the  method  developed  so  far  has 
much  to  recommend  it,  although  after  only  three  years  a great  deal  of  additional 
work  remains  to  be  done. 

It  will  be  convenient  to  present  this  material  in  several  major  sections. 
First,  a comprehensive  review  will  be  given  of  the  research  carried  out  prior 
to  this  paper,  after  which  will  follow  a report  of  recent  data,  and  finally  an 
evaluation  and  prospectus. 


II.  Summary  of  Past  Research 


This  section  briefly  summarizes  the  progress  made  on  Implied  Orders  test- 
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ing  up  to  recent  times,  includes  a short  review  of  our  research  in  several 
areas,  and  sketches  the  rationale  behind  the  methods. 


Review  of  the  TAILOR  System 


The  test  tailoring  system  developed  here  has  been  called  TAILOR.  It  is 
a simultaneous  procedure  for  acquiring  information  about  person  and  item  orders, 
which  unlike  all  other  methods  of  tailoring  tests,  treats  persons  and  items  as 
elements  of  a joint  order.  It  is  based  on  an  approach  of  simultaneously  gather- 
ing item  information  concurrent  with  the  administration  of  tailored  testing,  and 
thus  eliminates  the  need  for  a block  of  examinees  to  be  pretested  on  a complete 
item  pool.  This  elimination  of  pretesting  may  one  day  bring  tailored  testing 
within  reach  of  test  givers  who  cannot  afford  to  administer  thousands  of  tests 
as  a preliminary  to  actual  tailoring. 

The  basic  principle  is  a rather  simple  one.  Dichotomous  items  provide 
only  two  kinds  of  information.  Either  an  item  is  missed  and  therefore,  in  the 
terminology  used  here,  dominates  a person  or  it  is  answered  correctly  and  the 
person  dominates  the  item.  Depicted  pictorially. 


Generally  we  have  discussed  the  mathematics  of  TAILOR  in  matrix  termin- 
ology, but  as  has  been  stated  elsewhere  (Cliff,  Note  2)  these  matrices  are 
merely  the  formalization  of  directed  graphs  such  as  the  one  shown  below. 


Each  solid  arrow  represents  an  observed  correct  (F>  — ^1)  or  incorrect 
(I  • P)  answer  on  a hypothetical  test.  When  placed  in  a reasonable  order  such 
as  shown  below,  the  above  graph  demonstrates  how  the  observed  relations  between 
persons  and  items  can  first  be  used  to  qive  an  order  among  items  or  among  per- 
sons and  then  among  other  person-item  pairs  which  have  not  yet  been  observed. 


Il“ 


P1  — » T2 


* P2  ' ~ * XJ 


» P, 


When  presented  in  this  manner  it  should  also  be  apparent  that  if  a consistent 

pattern  of  dominances  exist  between  nearly  adjacent  elements,  then  this  also  < 

implies  the  nature  of  the  relations  between  elements  which  are  located  farther 

apart.  It  is  this  ability  to  imply  distant,  unobserved  relations  that  makes 

test  tailoring  possible. 

If  one  lonq  arrow  is  added  from  1^  on  the  left,  to  on  the  nqht  , as 

indicated  by  the  dashes,  the  overall  order  relationships  are  not.  much  enhanced. 

Every  test  user  knows  that  it  is  of  doubtful  utility  to  pair  the  briqhtest 
examinee  with  the  easiest  questions.  But  the  situation  for  closely  matched 
persons  and  items  is  likely  to  be  less  clear  cut  and  therefore  much  more  infor- 
mative. When  a welter  of  relationships  indicates  that  no  simple  order  exists 
then  a statistical  evaluation  of  the  weiqht  of  evidence  may  serve  to  clarify  the 
picture.  Just  such  an  approach  is  taken  by  TAILOR,  and  the  details  have  been 
described  elsewhere  (Cliff,  Cudeck  and  McCormick,  Note  4;  McCormick,  Note  C>) 

It.  should  be  emphasized  that  althouqh  the  persons  and  items  are  already 
ordered  in  the  directed  qraph,  in  a matrix  they  need  not  be  ordered  for  the 
implication  process  to  take  place  and  in  fact  are  not  formally  ordered  until  the 
end  of  testing.  So,  althouqh  formal  pretestinq  is  not  necessary,  TAILOR  does 
require  responses  from  people  to  order  its  items  and  thereby  to  also  order  its 
people.  The  operation  of  this  process  occurs  in  two  possible  ways  which  corres- 

» 

pond  to  two  separate  versions  of  TAILOR:  a FORTRAN  group  testing  version 
(Cudeck,  Cliff  and  Kehoe,  1977)  and  an  individual  testing  version  written  in 
APL  (McCormick  and  Cliff,  1977). 

In  the  qroup  testing  approach,  several  examinees  beqin  at  once  to  attempt 
a random  selection  of  items.  From  a very  few  initial  responses  the  order  beqins 
to  take  shape,  and  tailorinq  can  occur  to  varying  deqrees  depending  on  the  size 
of  the  item  pool  and  the  number  of  examinees.  No  one  in  the  qroup  version  is 
given  a complete  set  of  items  and  there  is  no  problem  of  item  security  caused 
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by  early  examinees  discussing  the  test  with  later  examinees. 

The  individual  testing  version  gives  complete  tests  one  at  a time  until 
enouqh  information  exists  to  begin  eliminating  the  most  extreme  items  from  the 
tailored  tests.  In  the  tests  given  to  date  with  this  second  method,  the  number 
of  persons  tested  before  tailoring  begins  has  been  invariably  two . As  more 
examinations  are  given  the  extent  of  tailoring  increases  until  finally  an  asymp- 
totic value  is  reached  In  the  tests  given,  this  seems  to  occur  near  fifteen 
examinations . 


Review  of  Previous  Evaluations 

Four  types  of  evaluations  have  been  c..  ried  out  on  the  Implied  Orders  model 
to  date.  Rather  complete  reviews  of  this  research  have  been  presented  previously 
(Cliff,  Cudeck,  and  McCormick,  Note  3;  Note  4) , but  a brief  summary  will  also  be 
given  here. 

Monte  Carlo  Results 

A Monte  Carlo  study  with  errorless  data  was  the  first  step  in  the  evaluation 
of  TAILOR.  The  group  testing  proqram  was  used  for  this  project,  the  details  of 
which  have  been  qiven  (Cudeck,  Cliff,  Reynolds  and  McCormick,  Note  5) . With 
errorless  data  this  Monte  Carlo  demonstrated  that  TAILOR  was  able  to  perfectly 
order  the  persons  and  items,  and  needed  only  about  50%  of  the  responses.  Depar- 
tures from  a complete  order  only  occurred  in  those  instances  where  the  "true 
scores"  assi.gned  to  two  persons  (or  two  items)  were  so  close  together  that  no 
item  true  score  fell  in  between.  Knuth  (1973)  has  indicated  that  the  minimum 
number  of  bits  of  information  required  to  sort  a set  of  n elements  is  loq^  n!  . 
It  is  interesting  that  TAILOR  performed  a similar  task  to  sorting  elements  with 
nearly  the  minimum  number  of  responses  qiven  by  the  theoretical  expectation. 

A second  type  of  Monte  Carlo  was  carried  out  which  used  a more  realistic 
kind  of  data  (Cliff,  Cudeck  and  McCormick,  Note  4) . With  the  Birnbaum  (1968) 
4-parameter  logistic  model,  various  sizes  of  response  matrices  were  generated 
which  were  based  on  specified  item  and  person  characteristics.  Samples  included 
10,  25  or  40  "persons"  and  15  or  25  "items",  each  had  at  least  5 replications 
and  most  were  based  on  several  sets  of  latent  parameters.  As  in  the  errorless 
case,  the  number  of  responses  required  varied  inversely  with  sample  size,  but 
averaged  around  50%.  With  this  kind  of  data  model,  the  correlation  of  true  score 
with  test  score  is  a validity  in  the  classical  test,  theory  sense.  By  far  the 
most  significant  effect  on  tailored  tests  validity  is  complete  test  validity 
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(the  correlation  of  ability  with  complete  test  score) . Tailored  validity  was 
close  to  complete  test  validity  although  a few  points  lower.  However  there 
was  an  increasinqly  large  effect  in  tailored  validity  as  compared  to  complete 
test  validity,  such  that  as  complete  test  validity  decreased,  tailored  test 
validity  fell  off  more  rapidly. 

Simulation  Studies 

A file  of  data  consisting  of  the  responses  to  122  Stanford-Binet  items 
by  622  children  was  used  in  the  next  evaluation  (Cliff,  Cudeck , and  McCormick, 
Note  4).  The  children  ranqed  in  age  from  2 years  to  just  under  15  years,  and 
all  together  had  an  average  IQ  of  117.3  , The  sample  was  divided  into  3 age 
groups  and  simulations  took  place  on  the  groups  separately.  The  method  used  here 
is  especially  important  for  tailored  testing  research  and  deserves  careful  pre- 
sentation. For  any  age  group  a random  sample  of  subjects  was  selected,  and  for 
this  sample  the  items  were  randomly  split  into  two  halves.  The  correlation 
between  the  two  data  sets  was  used  to  compute  a reliability  which  was  called 
Complete-Complete  parallel  forms  reliability.  One  of  these  matrices  was  then 
input  to  the  testing  proqram,  which  used  the  data  as  the  basis  for  a simulation, 
operating  in  a manner  similar  to  the  previous  Monte  Carlos.  The  correlation 
between  the  score  on  the  tailored  matrix  with  the  score  from  the  other  half  of 
data  produced  a Complete-Tailored  parallel  forms  reliability.  The  comparison 
of  Complete-Complete  with  Complete-Tailored  reliabilities  was  the  primary  means 
of  comparison  used  here,  and  it  is  the  most  straiqhtforward  method  of  evalua- 
tion possible  (see  Cliff,  Cudeck  and  McCormick,  Note  3) . As  with  the  fallible 
Monte  Carlo  data,  about  half  the  responses  were  used,  the  Complete-Tailored 
reliability  was  close  to  Complete -Complete , and  it  was  quite  dependent  upon 
the  Complete-Complete  reliability,  irrespective  of  aqe . 

Individual  Testing  with  Human  Sub jects 

The  above  Monte  Carlos  were  all  carried  out.  with  the  group  testinq  proqram. 
McCormick  (Note  6)  developed  an  individual  approach,  and  produced  the  first 
data  based  on  human  subjects.  Usinq  anagrams,  he  found  that  the  first  few •sub- 
jects took  nearly  the  entire  test,  but  that  subsequent  ones  needed  fewer  and 
fewer  items  reaching  an  asymptote  of  about  9 questions.  More  important,  the 
parallel  forms  reliability  for  two  tailored  tests  was  actually  higher  than  the 
reliability  for  an  independent  sample  who  took  the  same  items  as  two  complete 


tests. 


o 


I t 

I 

Summary 

In  Table  1 is  a much  distilled  summary  of  the  research  done  to  date. 

The  various  methods  of  evaluating  performance  are  listed  across  the  top  of 

the  table  and  the  four  different  designs  are  given  as  rows.  Although  slight 

R 

variations  exist  between  outcome  measures,  the  consistent  philosophy  has  been 
to  compare  tailored  testing  performance  with  an  independent  sample  of  data. 
The  various  outcome  measures  in  the  last  column  are  quite  high,  ranging  from 
.93  to  1.07,  while  the  percentage  of  items  used  went  from  44  to  55.  The 
summary  findings  are  uniformly  positive  and  suggest  that  the  Implied  Orders 
model  may  be  useful  for  several  kinds  of  tailored  testing  requirements.  It 
should  be  indicated,  however  that  these  results  are  averages,  that  sometimes 
the  method  can  be  expected  to  perform  better,  but  sometimes  it  may  actually 
do  worse.  Of  most  concern  to  us  is  the  fact  that  the  accuracy  of  a tailored 
testing  session  is  a direct  function  of  the  quality  of  the  original  data. 
Since  this  method  begins  with  no  information  about  either  subjects  or  items, 
it  is  especially  susceptible  to  producing  tailored  results  that  are  invalid 
as  a consequence  of  poor  data.  Our  tentative  evaluations  regarding  the  test- 
ing algorithms  are  linked  to  an  evaluation  of  the  quality  of  data,  such  that 
the  better  the  data,  the  better  the  performance  has  been. 


Consistency  Evaluation  and  the  Identification 
of  Unidimensional  Orders 


Consistency  Measurement 

In  the  context  of  Implied  orders,  test  items  provide  dominance  information. 

The  number  of  times  an  item  dominates  another  item  is  readily  found  by  pre-mul- 
tiplying  the  rights  matrix  by  the  transpose  of  the  wrongs  matrix.  Correspondingly, 
the  number  of  times  one  person  dominates  another  is  found  by  carrying  out  that 
same  multiplication  in  the  reverse  order.  If  the  dominance  matrix  is  consistent, 
it  is  highly  asynmetric , and  several  measures  of  such  consistency  were  proposed 
(Cliff,  Note  2s  Cliff,  1977). 

One,  c 5 , counts  the  number  of  dominance  relations  that  are  above  the 
diagonal  when  the  item=item  matrix  is  in  difficulty  order,  divides  it  by  the  total 
number,  and  transforms  this  to  a more  reasonable  scale  (See  Appendix) . A second, 
c , is  more  sophisticated.  It  compares  the  number  of  dominance  relations  to 
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the  number  that  would  occur  if  the  items  were  independent.  The  symmetric 
way  in  which  the  theory  treats  persons  and  items  allows  parallel  indices  for 
consistency  of  person  dominance  relations,  although  the  two  types  are  not 
directly  related  except  at  the  extremes. 

Mult idimen s ional ity 

One  promising  answer  to  the  need  for  pretesting  may  lie  in  the  derivation 
of  ordinal  measures  of  item  consistency,  used  in  con-junction  with  on-line 
tailored  testing  in  real  time.  What  we  have  had  in  mind  is  similar  to  a fac- 
tor analysis  of  dichotomous  items,  the  need  for  which  has  not  escaped  other 
researchers  (eg.  Bock,  Note  1,  p 47).  The  first  step  toward  this  goal  was 
taken  by  Reynolds  (Note  7) , who  outlined  a method  of  constructing  a unidimen- 
sional sample  of  items  from  a larger  heterogeneous  population.  The  basic 
proposition  in  this  work  is  that  the  Guttman  scale  can  serve  as  a model  for 
item  selection  if  one  has  a means  of  assessing  the  degree  to  which  internal 
consistency  is  improved  by  the  addition  of  candidate  items.  Reynolds  applied 
a measure  of  consistency  which  has  been  advanced  by  Cliff  (Note  2,  1977),  and 
his  results  were  quite  encouraging . Of  course  his  work  does  not  include  an 
application  to  the  testing  setting  in  exactly  the  form  called  for  here,  but 
it  is  important  as  a first  step  toward  that  goal. 

However,  the  usefulness  of  these  ideas  of  internal  consistency  is  not 
limited  to  construction  of  item  pools.  In  fact  the  primary  advantage  of  such 
an  index  is  as  an  omnibus  measure  of  overall  quality  of  a dataset.  Since 
adaptive  testing  research  demonstrates  such  a direct  relation  between  the 
quality  of  test  items  and  the  accuracy  of  the  examinee's  scores,  it  is  apparent 
that  consistency  measures  can  provide  information  which  is  vital  to  the  under- 
standing of  testing  data. 

SECTION  III.  PREVIOUSLY  UNREPORTED  RESEARCH 

Consistency  Index  Monte  Carlo 

The  indices  which  have  been  proposed  actually  have  quite  broad  applica- 
tion, and  in  the  testing  field  may  be  used  in  cases  of  both  complete  and  in- 
complete data.  Two  of  these  indices,  c and  c^  , seem  especially  pro- 
mising for  adaptive  testing.  Because  the  measures  are  of  such  recent  origin 


and  have  not  had  extensive  use  outside  this  project,  a brief  study  of  their  be- 
havior under  known  conditions  of  data  seemed  appropriate.  In  the  following  Monte 
Carlo  the  performance  of  some  standard  testing  statistics  were  compared  to  c 
and  ct3  with  complete  data.  The  Appendix  provides  a thorough  discussion  of  compu- 
tational procedures  for  both  measures  and  also  gives  a worked  example. 

Method 

The  following  2-phase  procedure  was  used  in  this  study: 

(1)  A theoretical  response  matrix  of  300  persons  and  200  items, 
with  ability  and  difficulty  distributional  parameters  fixed, 
chance  values  set  to  zero  and  mean  discrimination  egual  to 

(j  was  generated  according  to  the  Birnbaum  (1968)  model. 

d 

(2)  A sample  matrix  of p persons  and  i items  was  randomly  selected 
from  the  larger  matrix.  Values  of  KR20,  c , c 3 , v^  and 

v were  computed,  where 

v = the  Pearson  correlation  between  ability 
true  scores  and  obtained  number  correct 
in  sample  (i.e.,  validity) 

v = same  as  vr  , only  Kendall's  Tau  is  used 
instead  of  Pearson  r. 

Step  (2)  was  executed  10  times  to  get  mean  scores  on  the  above  statistics.  These 
means  became  the  values  for  KR20  g , c p , etc.  which  are  later  reported. 
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Steps  (1)  and  (2)  were  performed  for  a range  of  p from  0.2  to  2.0  . 

a 

The  fixed  ability  and  difficulty  parameters  used  here  described  two  theoreti- 
cal tests.  The  first  is  an  optimal  test  situation,  i.e.,  with  ability  and  item 
difficulty  having  the  same  parameters,  = P0  = 0 and  a = = 1 , both 

from  normal  populations.  As  usual,  b and  0 describe  item  difficulty  and  person 
ability,  respectively.  The  second  situation  represents  a broad  range  difficulty 
test  which  is  quite  difficult  with  p^  = 1 , = 2 . Henceforth  these  theoreti- 

cal populations  will  be  designated  as  "Optimal”  and  "Difficult". 

Results 

Table  2 and  Figure  1 shows  the  values  for  KR20,  c , cfc3  , v^  and  v^  (the 
metric  and  ordinal  validities)  as  a function  of  item  discrimination  for  the 
Optimal  tests.  As  can  be  seen,  the  two  indices  of  validity  are  moderate  for  the 

lowest  values  of  discrimination,  and  begin  to  reach  asymptote  for  p - .80  . As 

2 a 
expected  KR20  is  about  v^  , asymptoting  slightly  later  than  v^  . On  the  other 

hand,  both  cfc2  and  c^3  increase  monotonically  across  discrimination,  and  ct3 

appears  nearly  linear  as  a function  of  P . 

3 


TABLE  2 : Values  of  c 


KR  20  and  validity  as  a function 


t2  ' Ct3 

of  discrimination  for  two  populations 


Optimal  Test  Population 


11 

c „ 

c , 

KR  20 

V 

V 

<1 

t2 

t3 

r 

t 

.2 

-.54 

.03 

.34 

.59 

.43 

.4 

-.22 

.11 

.62 

.79 

.62 

. 6 

.04 

.24 

.78 

.88 

.72 

.8 

. 16 

.35 

.85 

.92 

.79 

1.0 

.32 

.47 

.89 

.93 

.81 

1.2 

.44 

.53 

.90 

.94 

.82 

1.4 

.54 

.60 

.91 

.94 

.84 

1.6 

.74 

.68 

.91 

.95 

.84 

1.8 

.75 

.77 

.93 

.96 

.88 

2.0 

.76 

.79 

.94 

.96 

.88 

Difficult 

Test  Population 

u 

a 

c 

c - 

KR  20 

V 

v 

t2 

t3 

r 

t 

*> 

-.  34 

.04 

.39 

.63 

.46 

.4 

.19 

.10 

.53 

.76 

.59 

.6 

.49 

.24 

.71 

.86 

.71 

.8 

.67 

. 36 

.77 

.87 

.72 

1.0 

.74 

.46 

.81 

.90 

.76 

1.2 

.82 

.50 

.79 

.90 

.78 

1.4 

.83 

.66 

.87 

.93 

.82 

1.6 

.89 

.69 

.85 

.93 

.82 

1.8 

.92 

.72 

.84 

.92 

.80 

2.0 

.94 

.74 

.83 

.93 

.83 
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Figure  2 contains  the  results  for  the  Difficult  test.  Aqain,  validity 
begins  at  a moderate  .4  or  .6  and  asymptotes  around  .8  . KR20  behaves  as  be- 
fore, although  it  more  closely  follows  the  rank  order  validity,  c also 
displays  an  asymptote,  but  it  occurs  much  later,  near  ua  = 1 4 . c shows 
a regular,  monotonic  increase  which  is  actually  greater  than  the  metric  validity 

at  2.0  . For  y,,>l,6  , c.  , ^ KR20  . 
a n t3 

Discussion 

This  data  suggests  several  tentative  conclusions  regarding  these  consis- 
tency indices  as  they  relate  to  validity  and  KR20.  Most  apparent  is  that  c^2 
and  c are  conservative  if  compared  to  KR20.  The  different  measures  do  not 
become  even  remotely  similar  until  highly  discriminating  items  are  used.  Thus 
when  Pa=  .4  , KR20  = .62  , c = -.22  and  c^3  = .11  (for  Optimal  data)  . This 
is  especially  important  to  consider  in  view  of  the  fact  that  estimates  of  dis- 
crimination greater  than  1.0  are  rare  in  real  test  data.  Thus  c in  the  range 
.25  to  .35  corresponds  to  acceptable  validity. 

Second,  as  the  test  departs  from  ideal  conditions,  c in  particular  be- 
haves fairly  regularly  across  the  values  of  ya  . From  this  data  it  appears  that 
the  ordinal  measures  are  somewhat  less  susceptible  to  departures  from  ideal 
conditions  than  the  more  metric  KR20,  and  the  discrepancy  would  be  even  more 
drastic  if  a chance  success  parameter  were  allowed. 

Thirdly,  the  data  remind  us  that  KR20  is  a parallel  form  reliability  esti- 
mated from  internal  consistency,  not  an  internal  consistency  measure.  As  V1  a 
increases,  the  consistency  of  the  data  in  the  usual  sense  increases  also,  but 
KR20  increases  much  more  quickly  and  then  levels  off.  In  contrast,  c^  and 
c 3 behave  somewhat  more  linearly  with  y#  , and  only  reach  large  values  as  t?he 
data  approach  a Guttman  scale. 

Seeing  such  low  values  of  internal  consistency  may  be  somewhat  disconcer- 
ting to  a researcher  accustomed  to  using  other  measures.  Indeed,  for  the 
adaptive  testing  researcher  who  goes  to  lengths  to  construct  batteries  of  highly 
discriminating  items,  these  modest  results  may  appear  quite  extraordinary. 

In  fact,  we  too  were  surprised  at  how  conservative  c anc  c are  in  many 
instances.  Roughly  speaking,  the  c measures  behave  more  like  averaqe  inter- 
item correlations.  However  it  should  be  indicated  that  the  evaluation  of  a 
score  matrix  by  the  c 's  conforms  precisely  to  the  ideal  model  of  a Guttman 
scale,  and  it  is  difficult  to  imagine  any  more  appropriate  basis  for  evaluation 
under  nonmetric  assumptions  than  a Guttman  simple  order.  These  statistics  pro- 
vide for  the  first  time  a rationale  of  internal  consistency  formulated  in  terms 


of  ordering  theory,  which  in  our  view  is  the  correct  way  to  conceptualize  consis- 
tency of  test  items. 


Further  Experience  with  Individual  Testing 

The  first  evaluation  of  the  APL  individual  testing  program  was  carried  out 
with  anagram  items  because  of  the  ease  of  computer  scoring.  Such  "free  response" 
items  are  particularly  attractive  because  the  probability  of  guessing  a correct 
answer  at  random  is  negligible.  Although  the  administration  of  tests  by  computer 
may  one  day  bring  the  demise  of  the  multiple  choice  item,  a multiple  choice  item 
pool  was  assembled  for  this  study  because  most  tests  today  use  a multiple  choice 
format. 

Seventy-two  general  knowledge  questions  were  adapted  from  the  College  Qualifi- 
cation Tests  (Note  8) . The  items  were  randomly  divided  according  to  our  usual 
practice  by  the  computer  into  parallel  forms.  Each  item  had  four  possible  answers. 
The  subject  matter  was  diverse,  and  dealt  with  such  far  ranging  topics  as  law, 
scientific  measurement,  electricity,  chemistry,  physics,  mechanical  engineering, 
geography,  political  science,  history,  physiology,  economics,  meteorology  and 
agronomy.  Subjects  were  67  introductory  psychology  students  alternately  assigned 
to  tailored  and  complete  test  conditions  as  they  arrived  for  testing. 

Method 

Individually,  thirty-four  subjects  were  given  the  tailored  version  of  the  two 
36  item  multiple  choice  tests.  First  one  item  was  given  from  pool  A and  then 
the  next  item  came  from  an  independently  tailored  pool  B.  No  information  from 
either  pool  was  ever  used  to  improve  the  tailoring  of  the  other.  When  one  of  the 
tailored  halves  was  finished,  the  remaining  items  were  necessarily  all  from  the 
unfinished  item  pool.  The  complete  test  version  was  given  to  33  examinees  at  the 
computer  terminal  with  A and  B items  simply  alternating. 

The  above  design  produced  four  sets  of  scores,  A and  B tailored  scores  and 
A and  B complete  test  scores.  As  in  the  past,  the  principle  analysis  consisted 
of  comparing  parallel  forms  reliabilities  for  experimental  (tailored)  and  control 
(complete)  groups.  Also  of  principle  interest  was  the  percent  of  items  necessary 
to  obtain  tailored  scores. 

In  addition  to  reliability  comparisons,  score  distributions  were  examined 
for  irregularities  and  the  sequential  nature  of  the  individual  testing  program 
utilized  to  compute  regression  analyses  based  on  the  extent  of  tailoring,  c 
and  cfc3  were  also  calculated  to  compare  the  quality  of  tailored  and  complete 
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test  data. 

Results 

The  testing  sessions,  as  revealed  in  Figure  3,  proceeded  much  as  they  did 
for  the  anagram  items.  The  first  two  tailored  subjects  answered  all  items, 
and  the  amount  of  tailoring  increased  rapidly  over  the  next  dozen  administra- 
tions, seeminq  to  approach  an  asymptote  of  around  11  or  12  items  (out  of  36) 
by  the  end  of  the  study.  The  overall  percentage  of  items  presented  was  41 
for  form  A and  42  for  form  B. 

The  actual  response  matrices  for  complete  and  tailored  tests  are  presented 
in  Figures  4 and  5 respectively.  The  items  and  persons  in  the  tailored  matrices 
are  labelled  such  that  a person's  identification  number  in  the  sequence  of  test- 
ing is  at  the  left  of  his  row.  Entries  at  the  side  and  bottom  represent  over- 
all net  dominance  scores  for  items  and  persons,  respectively.  A net  dominance 
score  is  the  number  of  persons  and  items  that  the  specified  person  or  item 
dominates  minus  the  total  that  in  turn  dominate  the  person  or  item. 

The  Pearson  r and  t reliabilities  for  complete  and  tailored  scores  are 

o 

given  in  Table  3.  Disappointingly,  the  tailored  reliabilities  were  r = .26 
and  t^  = .19  , whereas  the  corresponding  coefficients  were  .68  and  .57  for  the 
complete  tests.  The  reliabilities  are  significantly  different  at  a = .05  . 

Tailored  test  scores  corresponding  to  person-item  relations,  i.e.,  the  num- 
ber right  or  implied  right,  were  calculated  also  so  that  the  distribution  of 
tailored  scores  could  be  compared  with  that  for  complete  test  scores.  Means 
and  standard  deviations  of  the  score  distributions  are  presented  in  Table  3. 

TABLE  3 


Score  Means, 

Standard 

Deviations , 

c^_2  and  c ^ values  for 

Complete  and 

Tailored 

Data . 

Complete  Data 

Tailored  Data 

AB  reliability 

r 

.68 

.26 

tau 

.57 

.19 

Item  Pools 

A 

B 

A B 

Score  Means 

21.39 

19.97 

24.26  20.32 

Standard  deviations  5.73 

4.54 

7.24  6.86 

Ct2 

.04 

.10 

.32  .32 

Ct3 

.16 

.09 

.26  .19 
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FIGURE  4:  Complete  Test  Response  Matrices.  Rows  and  columns  in 

the  figures  are  permuted,  independently,  so  they  do  not 
correspond  to  the  same  persons  within  figures  or  the 


same  items  between  figures. 
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U Correct  Answers 

* Incorrect  Answers 

y)  Implied  Correct  Answers 

• Implied  Errors 

Blanks  are  implications  which  were  later 
revoked. 


FI CURE  5;  Tailored  Test  reponse  matrices.  Individuals  can  be 
identified  by  the  numbers  at  the  left. 
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A t-test  was  performed  for  tailored  scores  computed  in  this  manner,  and  also  for 
complete  test  scores  for  item  pools  A and  B.  The  a levels  of  the  two  comparisons 
were  aA  = .08  and  Ug  = .80  . Table  3 also  gives  values  for  c anc  c 3 , for 
tailored  and  complete  tests,  and  item  pools  A and  B.  As  can  be  seen  the  tailor- 
ed test  values  are  in  each  case  higher  than  those  for  complete  tests. 

The  relationship  between  the  number  of  items  presented  in  each  test  and  the 
number  of  tests  given  was  shown  in  Figure  3.  Correlations  obtained  for  a 
variety  of  measures  and  serial  positions  of  examinee  and  number  of  items  taken  are 
presented  in  order  to  clarify  changes  in  test  scores  as  a function  of  the  extent 
of  tailoring.  The  resulting  correlations  are  presented  in  Table  4. 

Absolute  differences  in  z scores  of  a particular  individual  on  items  A and 
B were  examined  for  trends  in  score  reliability.  Correlations  with  neither 
serial  position  nor  number  of  items  taken  were  significant.  The  absolute  values 
of  the  z scores  were  also  looked  at  for  possible  trends  in  score  variance.  None 
were  found. 

A significant  correlation  between  z scores  in  the  A item  pool  and  number  of 
items  taken  was  found  (r  = -.35)  . The  combined  correlation  for  A and  B items 
also  produced  a significant  correlation  ( « < .05) , although  the  B item  pool 
correlation  was  not  significant. 

Discussion 

The  complete  test  reliability  of  the  items  used  here  is  somewhat  lower  than 
the  reliability  of  the  anagram  items  used  in  the  first  study.  The  Pearson  r and 
ig  for  the  anagrams  are  .78  and  .61,  while  the  values  for  the  information  items 
are,  respectively  .68  and  .57.  Such  decreases  in  reliability  have  been  shown, 
in  Monte  Carlo  evaluations  of  the  group  testing  program  to  have  disproportionate 
effects  on  the  reliability  of  tailored  tests  compared  to  the  decrease  in  complete 
test  reliability  (Cliff,  Cudeck  and  McCormick,  Note  4) . Regression  equations 
calculated  from  the  Monte  Carlo  data  can  be  used  to  predict  tailored  test  re- 
liability from  complete  test  reliability.  Given  complete  test  r = .78  and 

r,  = .61  for  the  first  study,  the  regression  based  on  the  group  testing  Monte 
b 

Carlo  predicts  tailored  r = .66  and  t = .49  . The  values  obtained  from  indivi- 
dual live  testing  are,  r = .83  and  ^ = .61  . For  the  anagram  data,  then,  the  ob- 
tained results  are  considerably  above  the  predicted  values.  If  the  computations 
are  repeated  for  the  information  items  which  have  complete  test  reliabilities, 
r = .68  and  = .57  , the  predicted  values  are,  r = .50  and  Tb=  -49.  The 
obtained  values  are,  r = .26  andTb=.l9  . The  obtained  values  for  the  less 
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TABLE  4 : Correlations  from  z score  data 


z score  and  serial  position 

n 

A item  pool 

-.161 

34 

B item  pool 

-.045 

34 

Combined  data 

-.103 

68 

z score  and  no.  of  items  taken 

A items 

-.349  * 

34 

B items 

-.180 

34 

Combined  data 

-.265  * 

66 

Absolute  z scores  and  serial 

position 

A items 

.014 

34 

B items 

-.035 

34 

Combined  data 

-.010 

68 

Absolute  z scores  and  no.  of 

items  taken 

A items 

.002 

34 

B items 

-.052 

34 

Combined  data 

-.024 

68 

Absolute  difference  in  z scores  and  serial  position 

-.124  34 

Absolute  difference  in  z scores  and  no.  of  items  taken 

.090  34 


* denotes  alpha  less  than  .05 


I 
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reliable  information  items  are,  as  shown,  much  lower  than  the  regression  would 
predict . 

It  is  possible  that  these  two  points  represent  only  a change  in  the  slope 
of  the  relationship  between  complete  and  tailored  reliability,  but  it  seems 
unlikely  that  such  a sharp  drop  in  tailored  reliability  would  occur  between 
r = .68  and  r = .78  complete  test  reliability  without  the  additional  burdens 
of  the  guessinq  probability  of  .25  and  a very  heteroqeneous  item  pool.  Until 
additional  experience  accumulates,  such  speculation  cannot  be  confirmed. 

Among  the  secondary  analyses  a siqnificant  negative  correlation  was  ob- 
served between  number  of  items  taken  and  the  examinee's  z score.  There  are  two 
characteristics  of  the  data  which  make  this  result  plausible.  First  many  of 
the  brighter  examinees  took  fewer  items  simply  because  of  the  ceiling  of  diffi- 
culty. Secondly,  the  items  of  lesser  difficulty  appear  to  be  closer  together 
on  the  continuum  of  ability-difficulty  and  therefore  the  program  found  more 
items  at  the  ability  level  of  the  duller  subjects. 

The  average  values  of  c ^ found  here  for  both  tailored  and  complete  data 
are  lower  than  those  found  for  the  first  live  testinq  data.  The  averaqes  for 
the  first  study  were  .16  and  .42  for  complete  and  tailored  data.  For  the 
current  data  they  are,  respectively  .12  and  .22  . 

Although  TAILOR: API,  failed  to  make  successful  use  of  the  greater  consistency 
in  the  tailored  data,  the  existence  of  an  index  of  tailored  consistency  may 
help  to  improve  future  tests.  In  spite  of  the  fact  that  woefully  unreliable 
scores  were  obtained  in  this  study  the  proqram  presented  only  an  average  of 
.41  of  the  items  to  each  person.  If  the  stringency  of  the  statistical  rules 
for  establishing  dominance  relations  were  made  to  vary  with  c , then  perhaps 
when  the  data  reached  this  low  level  of  consistency,  more  relations  could  be 
required  from  each  subject. 

There  are  two  characteristics  of  these  items  which  may  make  them  rather 
inappropriate  for  tailoring.  First,  they  are  moderately  heteroqeneous  in  subject 
matter,  perhaps  particularly  in  a college  population  wher*1  individuals  have  had 
varying  amounts  of  formal  course  work  in  these  subject  areas.  Thus,  they  may 
form  subscales,  and  implications  based  on  relative  difficulty  are  subject 
to  error. 

Second,  these  are  multiple  choice  items.  The  implied  orders  system  is  likely 
to  be  particularly  vulnerable  to  the  errors  introduced  by  correct  guesses.  Some 
confirmation  of  this  is  found  in  Figure  5.  Note  first  the  band  of  "transition" 
responses,  mixtures  both  correct  and  incorrect,  along  the  main  diagonal.  There 


) 


m*  * 


22 

is  a considerable  scattering  of  "boxes"  (representing  correct  response)  in  the 
lower  left  portion,  fairly  far  from  this  diagonal,  representing  very  inconsis- 
tent  correct  responses.  In  contrast,  there  are  only  a few  stars  in  the  upper 
right,  representing  very  inconsistent  errors.  Thus,  this  item  pool  may  repre- 
sent one  which  is  inappropriate  for  tailoring. 

The  fact  that  more  consistent  data  is  obtained  in  tailored  tests  than  com- 
plete tests  raises  the  interesting  possibility  that  test  items  are  becoming 
more  discriminating  through  some  feature  of  tailoring.  This  means  not  only  that 
items  are  not  independent  units  with  parameters  that  characterize  them  re- 
gardless of  their  surroundings,  but  also  that  item  orders  can  be  optimized  to 
increase  the  efficiency  of  testing  beyond  the  more  generally  expected  effects 
of  tailoring. 

If  increases  in  consistency  continue  to  be  observed  when  tests  are  tai- 
lored, explanation  of  this  and  optimization  to  make  use  of  it  will  need  to  be 
investigated,  both  from  a practical  viewpoint  and  to  gain  a better  understanding 
of  the  psychological  process  of  problem-solving. 


Group  Testing  Approaches 

Overview 

The  group  testing  method  of  implied  orders  testing  has  been  used  in  a 
variety  of  Monte  Carlo  and  simulation  studies.  During  the  past  year  we  have 
also  tested  several  hundred  introductory  psychology  students  on  three  differ- 
ent batteries  of  tests,  and  have  developed  two  separate  versions,  one  in  FOR- 
TRAN for  the  IBM  370,  and  another  in  APL  at  the  Berkeley  Management  Science 
Center  (MSC) . In  this  section  the  current  thinking  regarding  group  testing 
approaches  is  presented.  In  general  we  conclude  that,  although  the  original 
presentation  of  the  model  has  been  in  the  group  testing  style,  this  approach 
has  several  serious  practical  limitations.  Then  a description  of  a second 
generation  strategy  is  outlined  which  incorporates  the  best  of  the  individual 
and  group  methods  into  one  approach  which  can  be  used  in  a variety  of  settings. 

The  Group  Testing  Algorithm 

The  transition  from  a Monte  Carlo  program  to  a program  for  actual  on-line 
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testing  occurred  in  a straight  forward  manner.  r>urinq  tile  extensive  series  of 
Monte  Carlos  it  was  assumed  that  each  simulated  subject  would  receive  one 
item  each  round  until  the  completion  of  the  test.  If  two  or  more  batches  of 
items  were  assessed  in  a single  simulation,  then  all  information  from  previous 
tests  had  to  be  completed  before  subsequent  tests  began. 

The  experience  at  the  MSC  was  moderately  encouraging  with  respect  to  this 
system  except  that  the  mini-computers  which  form  the  basis  of  this  installation 
were  much  too  slow  to  provide  a reasonable  test  of  the  process.  However,  it 
did  appear  to  have  the  serious  drawback  that  it  did  not  allow  subjects  to 
progress  at  different  rates. 

Therefore,  the  program  was  revised  to  follow  the  diagram  given  in  Figure  b. 
If  five  subjects  took  a test  for  example,  six  operating  programs  were  requited, 
five  of  which  were  for  the  individual  examinees,  while  the  sixth  was  for  the 
supervisor  routine.  These  routines  were  designated  1NDTEST  and  MONITOR.  INOTEST 
performed  simple  tasks  such  as  displaying  an  item  at  a subject  ’s  terminal, 
scoring  the  response  and  communicating  with  MONITOR.  MONITOR  performed  all 
the  computations  of  the  implied  orders  model  and  in  addition  policed  the  several 
individual  sessions.  The  most  crucial  aspect  of  this  system  was  the  approach 
for  communication  between  MONITOR  and  each  INDTF.ST.  From  tin*  perspective  of 
INDTF.ST  this  meant  that  a subject  could  only  answer  an  item  when  MONITOR  sup- 
plied one;  on  the  other  hand,  it  was  only  efficient  for  MONITOR  to  proceed 
when  approximately  half  the  subjects  had  responded. 

In  practice  this  approach  still  translated  into  a large  amount  of  wait  time 
for  all  concerned,  and  although  college  students  are  fairly  patient  people,  w<> 
have  become  extremely  dissatisfied  with  it.  Furthermore,  the  entire  system  is 
interlocked,  such  that  if  one  computer  terminal  fails  or  one  subject  accidently 
presses  a wrong  button,  the  whole  session  halts.  The  amount  of  wait  time 
and  the  number  of  restarts  after  a terminal  failure  have  indicated  two  things 
about  this  system.  First,  from  a practical  point  of  view,  the  current  imple- 
mentation for  group  testing  is  basically  suboptimal.  It  uses  .in  unrealistic 
assumption  about  how  subjects  behave,  for  no  group  of  individuals  ever  res- 
pond at  exactly  the  same  rates.  It  is  also  inefficient  in  terms  of  computer 
terminals  because  out'  machine  must  be  reserved  as  a port  for  the  MONITOR, 
because  there  has  been  so  much  procedural  manipulation  and  subject  wait  time, 
we  fool  that  the  group  testing  results  which  we  have  obtained  on  this  system 
with  human  subjects  are,  unfortunately,  misleading,  except  for  some  general 
procedural  cone  1 us i ons , 
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Although  the  above  difficulties  indicate  that  the  recent  research  with  the 
group  approach  has  severe  limitations,  we  cannot  resist  discussion  of  a consis- 
tently positive  finding  during  the  last  year.  In  previous  reports  of  the  indi- 
vidual testing  approach  it  was  noted  that  a few  complete  tests  must  be  given  before 
a reduction  in  the  number  of  items  occurs,  but  that  later  subjects  take  tests 
of  increasingly  shorter  lengths.  The  same  pattern  has  been  observed  with  the 
current  group  strategy,  however,  the  saving  in  items  begins  with  the  very  first 
group.  Figure  7 indicates  the  extent  to  which  this  is  possible.  The  plot 
labeled  Group  1 shows  the  percentage  of  implications  as  a function  of  the 
percentage  of  responses  for  the  first  5 subjects,  using  a well  known  test  bat- 
tery. The  Group  12  plot  shows  how  the  number  of  responses  decreases  as  a 
function  of  the  number  of  previously  tested  subjects.  In  this  instance,  46 
subjects  were  tested  prior  to  the  final  group  of  5 subjects  in  Group  12.  As  can 
be  seen,  the  savings  is  dramatic.  The  final  group  used  only  37%  of  the  items 
on  the  average.  This  figure  compares  favorably  with  previous  findings  from  the 
individual  testing  data,  and  in  fact  is  similar  to  the  Monte  Carlo  studies  for 
approximately  the  same  numbers  of  persons  and  iters  (see  Cudeck,  et  al.  Note  5) , 

We  have  benefited  greatly  from  the  information  th< se  practical  considerations 
have  provided,  although  the  loss  of  statistical  evaluation  for  the  method  is 
disappointing . 

These  concerns  have  resulted  in  a hybrid  concept  of  the  way  to  carry  out 
implied  orders  testing.  Although  no  working  program  has  been  written  yet,  such 
a routine  will  probably  be  developed  in  the  future.  It  is  basically  a modifi- 
cation of  the  individual  testing  strategy,  with  some  aspects  of  the  supervision 
of  the  group  approach.  It  is  pictured  schematically  in  Figure  8.  Each  individual 
program  just  makes  a copy  of  the  current  contents  of  the  integer  item  dominance 
matrix,  while  the  supervisor  prohibits  one  program  from  reading  the  data  when 
another  is  simultaneously  writing  it  back.  The  testing  proceeds  in  a manner 
described  by  McCormick  (Note  6)  until  the  end  of  the  test.  At  that  time,  the 
supervisor  again  controls  the  use  of  the  common  integer  item  dominance  matrix  so 
that  individual  programs  record  their  data  one  at  a time.  This  simple  modifi- 
cation represents  the  best  of  each  method  used  so  far.  The  result  is  that  sub- 
jects can  work  at  their  own  pace,  the  number  of  persons  being  tested  is,  as  the 
group  procedure,  limited  only  by  the  number  of  available  terminals,  and  there 
is  no  need  for  a separate  MONITOR  session.  Furthermore,  it  appears  possible 
to  write  such  an  algorithm  in  either  APL  or  FORTRAN  on  the  current  USC  computer 
system.  The  modification  to  either  program  would  be  straightforward. 


Group  12 


FIGURE  7:  A comparison  of  percentage  of  implications  as  a 
function  of  responses  for  the  first  and  twelfth 
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IV.  CONCLUSION:  PROGRESS  WITH  IMPLIED  ORDERS  TAILORED  TESTING 

In  this  section,  we  present  a final  summary  of  the  current  outlook  for 
Implied  Orders  tailored  testing.  It  is  difficult  to  be  definitive  in  some 
areas  because  so  much  work  remains  to  be  done,  or  because  the  information 
which  is  available  is  somewhat  conflicting.  On  the  other  hand,  several  find- 
ings have  been  replicated  so  often  that  we  feel  quite  confident  in  making 
a general  statement. 


Evaluation  of  the  Testing  Algorithm 

The  Model 

The  implied  orders  concept  which  forms  the  basis  for  the  TAILOR  procedure 
takes  the  Guttman  Scale  as  a prototype.  It  assumes  that  the  goals  of  a test  is 
to  order  the  persons  with  respect  to  each  other  and  to  the  items,  as  sketched 
in  the  introductory  remarks  of  this  report  and  elaborated  earlier  (Cliff,  1975; 
Cudeck,  et  al,  Note  5;  Cliff,  et  al , Note  4).  No  meaningfully  large  collection 
of  data  from  persons  and  items  conforms  exactly  to  this  ideal,  however,  and 
the  data  always  contains  inconsistency  which  must  be  allowed  for. 

The  primary  procedure  for  adjusting  the  model  to  the  realities  of  the  data 
is  an  approximate  significant  test.  In  a Guttman  scale,  item  j is  easier  than 
k if  there  are  more  people  who  get  j right  and  k wrong  than  the  reverse.  Simi- 
larly, in  an  error-free  tailored  test,  person  i is  implied  to  answer  item  j 
correctly  if  he  has  gotten  harder  items  correct.  With  real  data,  we  substitute 
statistical  comparisons  that  say  item  j is  easier  than  k is  "significantly  more" 
persons  get  j right  and  k wrong  than  the  reverse,  and  person  i is  implied  to 
get  item  j correct  if  he  has  answered  significantly  more  harder  items  correctly 
than  he  has  answered  easier  items  incorrectly.  The  early  tinkering  which 
as  a very  time-consuming  part  of  this  research  showed  that  the  optimal  results 
occurred  when  the  criterion  for  significance  was  set  very  low,  provided  cer- 
tain safeguards  (the  "probability  test"  illustrated  in  McCormick,  Note  6) 
were  included. 


Data  on  Implied  Orders 

The  conclusions  that  seems  justified  from  the  experience  to  date  is  that 
such  a model,  bolstered  by  a heuristic  device  to  absorb  error,  i.e.,  the  signi- 
ficance test,  works  very  well  provided  the  data  are  highly  consistent.  The 
basis  for  this  conclusion  is  in  the  Monte  Carlo  studies  and  the  anagram  data 
from  TAILOR-APL. 
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The  extensive  Monte  Carlo  study,  supported  by  the  simulation  with  the  file 
of  Binet  data  (Cliff,  et  al,  Note  4)  showed  that  TAILOR  scores  would  be  more 
reliable  than  complete  test  scores  based  on  the  same  number  of  items.  This 
superiority  would  be  appreciable  if  the  items  had  very  high  discrimination 
indices,  and  negligible  if  they  were  moderate.  This  finding  was  bolstered 
by  the  startling  results  of  the  first  anagram  study  (McCormick,  Note  6).  This 
found,  in  two  separate  replications,  that  TAILOR  scores  based  on  less  than  half 
the  items  were  more  reliable  than  total  test  scores,  although  not  significantly 
so.  Examination  of  the  findings  suggested  that  these  data  matrices  were  highly 
consistent,  particularly  in  the  case  of  the  tailored  data.  The  apparent  fact 
that  the  tailored  responses  were  more  consistent  than  conventional  responses 
to  the  same  items  was  interesting  in  its  own  right  and  led  to  optimism  that 
the  TAILOR  system  could  be  readily  made  the  basis  for  an  operating  system. 

The  second  subsequent,  real-subject  trial  reported  here  was  not  encouraging. 

It  reported  lower  reliability  for  the  TAILOR  scores,  and  such  results  would  in 
general  be  regarded  as  unsatisfactory  for  operational  use.  Reasons  for  this 
finding  have  been  presented  in  the  description  of  the  study.  The  basic  one 
is  perhaps  simply  that  TAILOR  is  sensitive  to  the  degree  of  consistency  of 
the  data,  and  this  test  was  simply  not  internally  consistent  enough  for  the 
TAILOR  process  to  be  effective.  Possible  multifactor  structure  and  the  pre- 
sence of  guessing  are  likely  causes  of  the  inconsistency. 

Thus,  the  two  tryouts  with  real  data  and  Binet  simulation  (Cliff,  et  al, 

Note  2)  show  that  our  procedure  can  work  as  a tailoring  system. 

Computer  Programs 

The  original  idea  for  TAILOR  was  that  the  testing  process  would  involve 
a group  of  subjects  being  tested  at  the  same  rate,  in  rounds,  as  it  were.  This 
is  clearly  inefficient  since  some  subjects  work  faster  than  others  and  some  will 
require  more  items  than  others.  It  is  preferable  that  they  be  able  to  work  at 
their  own  rate,  or  at  least  at  one  of  several  different  rates.  This  is,  of  course, 
the  way  the  individual  program  operates.  However,  in  that  program  only  one  sub- 
ject can  be  tested  at  a time;  information  from  examinees  is  part  of  the 
data  base  only  after  they  have  completed  the  test.  This  too  is  undesirable. 

An  optimum  version  of  the  program  is  being  contemplated  which  will  in 
effect  store  the  data  from  all  subjects  centrally,  update  it  as  it  comes  in  from 
subjects  and  will  handle  several  of  them  at  the  same  time. 

The  centrally  important  data  in  the  implied  orders  system  is  the  matrix 
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which  we  call  the  item  dominance  matrix.  It  is  an  item  by  item  matrix,  denoted 
N in  the  theoretical  articles  (Cliff,  1975)  and  it  has  two  forms,  integer  and 
binary  (dichotomous).  The  integer  form  records  in  cell  j,k  the  number  of  persons 
who  get  j wrong  and  k right.  The  binary  form  contains  a 1 in  the  corresponding 
element  if  "significantly"  more  persons  got  j wrong  and  k right  than  the  reverse. 
The  significance  tests  converts  the  integer  information  into  the  binary  form  in 
the  earlier  versions  of  the  program.  The  proposed  version  stores  both. 

The  current  version  of  the  program  calculates  the  integer  dominance  matrix 
as  a matrix  multiplication,  after  each  round  in  the  group  version  and  after 
each  person  in  the  individual  one.  This  is  a very  substantial  calculation, 
even  by  the  method  we  use.  However,  the  response  of  a person  to  an  item  can 
only  alter  elements  in  one  row  or  column  of  the  matrix.  This  calculation  is  a 
singly  subscripted  loop  — at  worst  — rather  than  a triply  subscripted  one, 
and  so  it  can  take  place  with  great  rapidity,  enabling  the  program  to  handle 
inputs  from  a number  of  terminals.  The  program  would  operate  on  an  as  needed 
basis  , so  examinees  could  be  taking  the  test  at  either  the  same  time  — up  to 
the  limits  of  the  terminal-monitoring  capability  — or  at  different  times.  The 
program  will  tailor  on  the  basis  of  the  information  that  it  has  at  that  moment. 

One  final  note  may  be  made  regarding  data  consistency  and  program  operation. 
As  currently  written,  TAILOR  makes  three  kinds  of  implications  and  the  decision 
rule  in  each  instance  is  the  same.  One  possible  modification  may  be  to  alter 
the  rules  used  according  to  the  task, and  in  addition,  to  alter  each  rule  accord- 
ing to  the  quality  of  the  data.  Thus  if  the  items  in  a particular  test  are 
consistent,  the  requirements  for  an  implication  to  be  made  could  be  relaxed.  If 
the  data  are  inconsistent,  the  decision  rules  could  be  made  more  stringent.  As 
regards  the  kind  of  implication  being  made,  it  appears  likely  that  item-item 
dominances  which  are  based  on  many  responses  suggest  a lenient  decision  criterion 
because  the  weight  of  evidence  should  serve  to  make  any  implication  probably 
correct.  Item-person  implications  may  require  a stringent  rule  due  to  the  fact 
that  much  less  information  exists  for  the  subjects  and  therefore  any  implications 
are  more  probabilistic. 

Thus  two  different  kinds  of  significance  may  be  required,  and  they  might 
be  differently  affected  by  the  level  of  consistency  of  the  data.  Neither  change 
would  complicate  the  operation  of  the  program  very  much,  and  they  would  hardly 
affect  running  speed.  What  they  do  require  is  research,  particularly  theoreti- 
cal research,  to  establish  the  relation  between  the  degree  of  consistency  and 
optimum  degree  of  tailoring. 
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Contributions  to  Test  Theory 

Consistency  Measures 

This  research  program  also  produced  significant  conceptual  developments. 

The  consistency  measures  proposed  by  Cliff  (1975,  1977)  provided  a method  of 
reconceptualizing  test  item  consistency  in  terms  of  dominance  concepts  that 
are  perhaps  more  valid  to  testing  than  are  the  traditional  correlational  ones. 
More  directly  important  for  tailored  testing  was  the  fact  that  these  consis- 
tency measures  were  structured  in  such  a way  as  to  make  them  applicable  to 
tailored  tests. 

Only  recently  has  it  been  possible  to  incorporate  these  indices  into  the 
TAILOR  program  in  order  to  monitor  the  consistency  of  the  data.  The  final 
study  reported  above  makes  use  of  this  information,  but  what  is  still  develop- 
ing is  any  intuitive  feel  for  the  magnitude  of  the  numbers  and  what  a given 
consistency  value  is  likely  to  mean  in  terms  of  the  efficiency  of  the  process 
of  tailoring.  We  anticipate  that  other  workers  will  begin  to  reference  these 
indices  as  means  of  judging  consistency.  One  difficulty  may  be  that  while 
they  are  very  plausibly  defined  they  appear  as  startlingly  low  numbers  to  one 
used  to  the  usual  reliability  coefficients. 

Defining  Subpools 

The  use  of  these  coefficients  to  subdivide  an  item  pool  into  homogeneous 
subsets  was  pioneered  by  Reynolds  (Note  7) . This  proved  very  effective  in  his 
case,  and  these  methods  would  probably  be  fruitful  to  pursue  as  a means  of 
"factoring"  dichotomous  items.  This  is  a direction  that  we  intend  to  followup. 

This  development  should  be  facilitated  by  an  extension  of  the  theoreti- 
cal concepts  which  have  been  made  only  recently.  Our  work  draws  heavily  on 
the  concepts  of  dominance:  item-person  dominance,  item-item  dominance,  and 
person-person  dominance.  The  recent  development  is  in  essence  a way  of 
joining  the  concept  of  dominance  to  the  traditional  one  of  correlation.  Rela- 
tions on  items  may  be  redundant  to  relations  on  other  items,  contradictory  to 
relations  on  other  items,  or  unique  to  these  items.  Statistical  definitions  of 
these  concepts  allows  one  to  consider  dominance  and  correlation  simultaneously. 
This  fact  suggests  a means  of  measuring  not  just  consistency  but  effectiveness 
of  a pool  of  items,  and  also  a method  for  dividing  items  into  sub-pools  of  max- 
imum effectivenss . Many  details  remain  to  be  worked  out,  but  the  direction  seems 
very  promising  and  some  of  it  has  been  presented  at  the  Psychometrics  Meeting 
in  Uppsala.  The  measures  suggested  by  these  methods,  too,  generalize  quite 
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straightforwardly  to  the  tailored  case. 


Research  Methodology  in  Tailored  Testing 


If  one  is  developing  a tailored  testing  system,  one  would  of  course  like  to 
know  how  well  it  works.  One  of  the  contributions  of  this  project  is  felt  to  be 
the  clear  definitions  used  of  tailored  test  effectiveness.  These  are  not  unusually 
subtle  but  they  are  somewhat  different  than  used  elsewhere.  They  use  elementary 
design  principles  of  not  confounding  the  dependent  with  the  independent  variables, 
and  not  confounding  independent  variables  with  each  other. 

The  two  traditional  qualities  used  to  judge  a psychometric  variable  are 
reliability  and  validity.  Thus  we  have  chosen  to  use  as  our  major  dependent 
variables  the  correlation  of  a tailored  score  with  a parallel  score  or  with  a 
true  score.  Moreover,  we  have  avoided  the  experimental  confounding  of  tailored 
scores  with  either  one.  Thus,  we  derived  in  the  real  data  studies  two  parallel 
tailored  scores  by  independently  tailoring  two  subtests  and  correlating  the  scores 
on  the  two.  This  practice  avoids  the  confounding  introduced  by  such  practices  as 
correlating  a tailored  score  with  a conventional  score  on  the  same  items,  where 
responses  to  items  enter  into  the  determination  of  both  scores.  Also,  in  our 
Monte  Carlo  studies  where  a true-score  was  used,  the  latent  parameters  are  used 
to  generate  item  scores  only.  The  tailored  test  scores  can  then  be  correlated 
with  the  true  scores  to  determine  validity,  and  this  validity  is  not  confounded 
by  any  experimental  dependence  between  the  two. 

The  use  of  scores  on  randomly  parallel  tailored  tests  and  randomly  parallel 
complete  tests  to  determine  reliability  also  avoids  the  possible  confounding  of 
item  quality  with  tailoring.  This  can  occur  when  "tailoring"  includes  winnowing 
an  item  pool  to  select  the  best  items.  Here,  measurement  of  efficiency  of  the 
tailoring  process  is  confounded  with  the  effects  of  using  the  more  effective 
items.  An  appropriate  question  in  evaluating  a tailored  process  is  "How  valid 
is  a score  derived  from  tailoring  this  item  pool  compared  to  using  the  complete 
pool?" 

The  model  used  here  automatically  circumvents  an  inferential  problem  which 
troubles  those  tailoring  procedures  which  estimate  a true  score  and  use  an  infor- 
mation measure  as  the  definition  of  the  effectivenss  of  the  testing  process. 

Aside  from  the  fact  that  the  item  statistics  used  in  this  are  at  best  estimates, 
not  the  true  values  of  the  item  parameters,  it  appears  likely  that  the  items 
may  behave  differently  in  the  tailored  and  untailored  contexts,  therefore,  item 
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statistics  derived  from  the  one  are  not  appropriate  for  exact  estimation  in 
the  other.  In  any  event,  where  such  procedures  are  used,  the  criterion  of 
correlation  with  separately  and  independently  tailored  tests  should  be  used 
with  real  data  evaluations.  In  Monte  Carlo  evaluations  with  an  estimation 
process,  one  should  be  sure  to  use  sample  estimates  of  item  parameters,  not  the 
population  values,  and  actually  correlate  the  estimated  scores  with  true 
scores,  not  rely  on  information  measures. 


Future  Prospects 


Tailored  Testing 

It  is  perhaps  surprizing  that  the  implied  orders  system  has  worked  at  all. 
How  well  it  works  compared  to  other  tailoring  systems  is  hard  to  say  since  they 
have  rarely  provided  the  kind  of  data  we  feel  is  necessary  for  evaluation.  It 
may  be  that  no  tailoring  system  will  always  work  well  except  perhaps  in  a con- 
text which  justifies  a large  investment  in  item  development  and  pretesting.  How- 
ever, it  seems  that  it  should  also  work  in  situations  where  goals  are  such 
things  as  placement  in  a training  sequence  (i.e.,  the  ability  scale  is  large) 
or  multifactor  testing  where  one  is  primarily  interested  in  identifying  extremes 
on  these  factors.  In  both  of  these  situations  the  Tailor  program  or  a descendant 
of  it  would  have  an  important  place. 

The  suggested  form  of  an  operational  program  has  been  sketched  earlier. 

It  also  seems  that  it  might  be  desirable  to  incorporate  response- timing  fea- 
tures into  the  program  so  that  in  effect  each  item  acts  as  several.  Also,  it 
might  be  useful  where  multiple  choice  responses  are  used  for  each  alternative 
to  be  scored  separately. 

Ordinal  Test  Theory 

The  concepts  used  in  this  research  have  provided  a basis  for  the  measure- 
ment of  consistency  of  items  which  is  measured  directly  from  the  inter-relation- 
ships among  the  items.  Thus  tailored  data  can  be  evaluated  as  well  as  conven- 
tional. As  indicated  above,  it  appears  that  several  new  directions  can  be 
explored  from  these  bases,  and  they  are  likely  to  be  particularly  useful 
for  tailored  testing.  They  also  suggest  new  bases  for  determining  the  multi- 
factor structure  of  data. 
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APPENDIX 


Method  for  computing  c „ and  c 

t2  t3 

In  the  oriqinal  presentation  of  these  measures,  c is  derived  in  a manner 
similar  to  many  other  psychometric  devices  in  which  the  statistic  is  a ratio 
between  explained  variance  to  total  variance  That  is,  c ^ uses  two  kinds  of 
information  from  the  ordered  inteqer  item  dominance  matrix,  N = S'S,  where  S 
is  as  usual  the  n persons  by  x items  score  matrix.  Then  V,  the  total  number 
of  dominances,  is 

x x 

V = £ £ n 

j=l  j-1  jk 

and  v^  , the  number  of  persons  who  respond  in  accordance  with  the  order  is 


V = E E 

m ..... 
3=1  k=3+l 


lnjk  - V • 


The  upper  triangular  portion  of  N will  contain  all  the  nonzero  entries  when  S 

displays  a perfect  order.  Thus  V /V  will  be  unity  for  data  which  conform  to  a 

m 

Guttman  scale,  but  zero  for  a random  response  pattern.  Actually  c ^ is  modi- 
fied to  take  on  values  in  the  range  -1  — c — 1 by  the  linear  transformation - 


c „ = 2 (V  /V)  - 1 . 
t2  m 


As  can  be  seen,  c ^ is  similar  in  intent  to  a percentage  agreement  between 
several  judges.  It  is  equivalent  to  an  index  from  Loevinger  (1947)  which  was 
devised  to  assess  homogeneous  tests,  a concept,  very  closely  related  to  the  pre- 
sent development. 

In  contrast,  c^  relies  on  the  consideration  of  marginal  scores  to  determine 
consistency.  When  items  are  highly  discriminating  and  have  a broai  ranqe  of 
difficulties,  their  marginal  scores  will  reflect  this  condition,  and  one  may 
expect  highly  consistent  data  for  this  reason  alone.  Thus  the  intention  of 
ct3  is  to  correct  for  spuriously  high  consistency  due  to  differences  in  diffi- 
culty. In  that  manner  it.  resembles  Cohen's  kappa  (Cohen,  1960)  by  correcting 
in  accordance  with  the  marginals.  This  adjustment  is 

V - V 

Ct3  = TTF' 
c m 
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where 


and 


V = £ E n.. 

° i wj  Jk 


w , ( n - w,  ) 

S - E(n>  - — 1— S- 

]k  }k  n 


with 


w . = s . . 

3 1=1  13 


The  following  detailed  example  is  from  Cliff  (1977,  p.  377)  with  inconsistent 
complete  data. 

Step  1.  From  S,  compute  S'  , N and  w^ 
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Step  2.  Using  N,  compute  V and  V 


m 


V=0+2+2+2+l+0  .,.+2+1+0=  17. 

V = (2  - 1)  + (2  - 1)  + (2  - 1)  + (1  - 1)  + (2  - 2)  + (1  - 1) 
m 


Step  3.  Compute  c 


t2 


c_  = 2 (V  /V)  - 1 = 2(3/17)  - 1 = -.647 
t2  m 


Step  4.  From  the  marqinal  scores,  compute  the  expectation  for  N, 

_ In  - w . ) 
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